Method for qualitative evaluation of a digital audio signal

ABSTRACT

The invention relates to a method of qualitatively evaluating a digital audio signal. It calculates a quality indicator consisting of a vector associated with each time window in real time, in continuous time, and in successive time windows. For example, the generation of a quality indicator vector calculates, for a reference audio signal and for an audio signal to be evaluated, the spectral power density of the audio signal, the coefficients of a prediction filter, using an autoregressive method, a temporal activity of the signal or the minimum value of the spectrum in successive blocks of the signal. To evaluate the deterioration of the audio signal, the method may calculate a distance between the vectors of the reference audio signal and the audio signal to be evaluated associated with each time window.

The present invention consists in a method of evaluating a digital audiosignal, such as a signal transmitted digitally and/or a digital signalto which digital coding, in particular bit rate reduction coding, and/ordecoding has been applied. A signal transmitted digitally may be anindependent audio signal (as in the case of radio broadcasting) or anaudio signal that accompanies a program such as an audiovisual program.

BACKGROUND OF THE INVENTION

The field of digital broadcasting and digital mobile radio is expandingfast, in particular following the introduction of digital television andmobile telephones. In order to be able to provide a quality assuredservice, new instruments need to be developed for measuring the qualityof all the systems necessary for the deployment of this technology.

Subjective tests are used for this purpose that evaluate the quality ofsound signals by having experts or novices listen to them. This methodis time-consuming and costly, because many strict conditions must becomplied with for such tests (choice of panelists, listening conditions,test sequences, test chronology, etc.). It nevertheless yields databasesconsisting of reference signals and the scores assigned to them. Thesetests yield Mean Opinion Scores (MOS) that are recognized as thebenchmark in the area of quality estimation.

Many studies of the human hearing system have been carried out with theaim of minimizing the number of subjective tests. Based on this work,models of the ear and of psychoacoustic phenomena have been developedand have been used to analyze sound signals and to estimate theirquality using objective methods. The quality measured is the quality asperceived by the human ear, and is therefore referred to as theobjective perceived quality.

It is possible to distinguish three classes of objective test methods:the first of these classes is the “complete reference” class in whichthe original signal is compared directly with the degraded signal (i.e.the signal after coding, broadcasting, multiplexing, etc.); the secondclass is the “reduced reference” class in which only parametersextracted from the two signals are compared; in the third class, defectsgenerated by the broadcasting system are detected using their known maincharacteristics, and this circumvents the constraints associated withthe use of a reference signal (in all other cases, the reference must betransmitted to the place of comparison and then synchronized preciselywith the degraded signal, which makes the system complex and morecostly).

Degradation by transmission errors significantly reduces the quality ofthe signal and occurs when broadcasting an MPEG digital stream, forexample, or when broadcasting via the Internet, especially in the caseof radio broadcasts.

In this context, it is desirable to have a method of objectivelymeasuring the quality of a broadcast audio signal either without using areference signal at all or using a “reduced” reference signal, forexample because only these methods are suitable for monitoring abroadcast network where a plurality of remote measuring points may benecessary. It is also beneficial to exploit the relative simplicity ofthis kind of method for measuring the quality of a digital audio signalthat has been subjected to digital coding, in particular with bit ratereduction, and/or decoding, whether the signal has been transmitted ornot.

The number of audio quality measuring methods that have been developedvaries widely from one class to another. A large number of completereference methods have been developed, but only a few reduced referencemethods or methods that do not use a reference.

Complete reference methods, which compare the signal to be evaluatedwith a reference signal, comprise the standard techniques used toestimate the quality of radio coders, for example. Their generalprinciple is to use a perceptual model of human hearing to calculateinternal representations of the original signal and the degraded signaland then to compare these two internal representations. One example of amethod of this kind is described in the paper by JOHN G. BEERENDS andJAN A. STEMERDINK, “A Perceptual Audio Quality Measure Based on aPsychoacoustic Sound Representation”, published in “Journal of the AudioEngineering Society”, Vol. 12, December 1992, pages 963 to 978.

In order to obtain a representation that is as faithful as possible,these hearing models are based on masking experiments and must make itpossible to predict whether the deterioration will be audible or not,since not all deterioration of a signal is audible or a nuisance.Perceptual models using a reference are based on the FIG. 1 diagram, andmany methods of varying sophistication rely on this principle. ThePErceived Audio Quality (PEAQ) algorithm was recently standardized bythe ITU-R in Standard BS.1387. This algorithm is based on the standardprinciples and combines them with a quality prediction model using aneural network.

Although it must be remembered that they were designed for evaluatingthe impact of coding, the major benefit of these techniques is theability to detect very slight deterioration. The measurements obtainedare relative in that only differences are taken into account in thistype of measurement. In the case of a coder of very high quality, aseriously degraded signal will be coded and then decoded almosttransparently, and a very high score will therefore be assigned.Moreover, the score could be low for a signal that has been modified(equalized, colored, etc.) between the step of calculating the referenceand the comparison step, even if the perceived quality of the twosignals is very high.

There are as yet few methods that do not use a reference. TheOutput-Based objective speech Quality (OBQ) method is the most highlydeveloped of the “no reference” methods. It is a method of estimatingthe quality of a speech signal alone, with no reference signal, and isbased on calculating perceptual parameters representing the content ofthe signal, combined into a vector. Vectors calculated for non-degradedsignals constitute a reference database. Quality is estimated bycomparing the same parameters obtained from degraded signals withvectors from the reference database. The main method using neuralnetworks is the Objective Scaling of Sound Quality And Reproduction(OSSQAR) method. The general principle of this method is to use ahearing model and a neural network conjointly. To simulatepsychoacoustic phenomena, the network predicts the subjective quality ofthe signal from a perceptual representation of the signal calculatedusing the hearing model. Note that the results obtained with thesemethods are much better if the signals are part of the trainingdatabase, or at least if they have similar characteristics.

Thus these methods are not suitable for evaluating the quality of allsignals, for example radio or TV broadcast audio signals.

As indicated above, most objective perceptual measurement algorithmsusing a complete reference operate in accordance with the sameprinciple; they compare the degraded sound signal and the originalsignal (i.e. the signal before transmission and/or coding and/ordecoding, called the reference signal). These algorithms thereforerequire a reference signal, which must additionally be synchronized veryaccurately with the signal under test. These conditions can only besatisfied in simulation or during tests on coders and other “compact”systems or systems that are not geographically distributed; in contrast,the situation is very different when receiving a signal broadcast fromsend antennas A₁ and receive antennas A₂ (see FIG. 2).

The reference signal must be available at the comparison points. Theonly option for using a complete reference method is to transmit thereference to the comparison points without errors and then tosynchronize it perfectly. These complete reference methods are notapplicable in practice, for reasons of spectral congestion, andtherefore of cost, as they would necessitate the use of a transparentsecond transmission channel.

The methods with no reference that have been proposed may yield goodresults, but only with signals having known characteristics modeledduring the training phase. Methods with no reference do not work well onany signal.

Using a “reduced” reference, in which the reference audio signal ischaracterized by one or more numbers, has been suggested. A method ofthis kind is described in French Patent Application FR 2 769 777 filed13 Oct. 1997. However, this method is not able to process all thesamples, in particular because the bit rate of the proposed referencesignal (which is at least 36 kbit/s for windows comprising 1024 signalsamples) is too high to satisfy the practical constraints oninstallation and implementation in a broadcast network.

OBJECTS AND SUMMARY OF THE INVENTION

The present invention proposes a method whereby the indicators aresimpler and may be calculated in real time and in continuous time andrequire a much lower bit rate. The deterioration may modify only a fewsamples, even though it seriously degrades quality, and the proposedmethod enables the entire audio stream to be analyzed.

The method of the invention provides a reliable estimate of the qualityof an audio signal that has been transmitted or coded digitally, sincedisturbances affecting the transmission channels may induce errors inthe data transmitted that are reflected in a degraded final audiosignal.

The technological approach proposed consists in effecting onemeasurement of the audio signal at the input of the system under testand another at the output. Comparing these measurements verifies thatthe transmission channel is “transparent” and evaluates the magnitude ofthe deterioration that has been introduced.

By detecting deterioration on the basis of the signatures of thecharacteristics of the more serious defects to be identified, theproposed approach reliably estimates the deterioration introduced,whether it is used in conjunction with methods that use no reference ornot. It further alleviates the lack of a reference signal. In the caseof reduced reference measurements, this method reduces the reference bitrate necessary for estimating quality, and in the case of measurementswith no reference it reduces the number of parameters that have to beused.

Thus the invention provides a method of evaluating a digital audiosignal, comprising calculating, in real time, in continuous time, and insuccessive time windows, a quality indicator which consists, for eachtime window, of a vector whose dimension is advantageously at least onehundred times smaller than the number of audio samples in a time window.This dimension is from 1 to 10, for example, and preferably from 1 to 5.

The digital audio signal to be evaluated may have been transmitteddigitally and/or subjected to digital coding, in particular with bitrate reduction, starting from a reference digital signal.

In a first variant, using a perceptual count difference, the generationof a quality indicator vector employs the following steps for areference audio signal and for the audio signal to be evaluated:

a) calculating for each time window the spectral power density of theaudio signal and applying to it a filter representative of theattenuation of the inner and middle ear to obtain a filtered spectraldensity,

b) calculating individual excitations from the filtered spectral densityusing the frequency spreading function of the basilar scale,

c) determining the compressed loudness from said individual excitationsusing a function modeling the non-linear frequency sensitivity of theear, to obtain basilar components,

d) separating the basilar components into classes, preferably into threeclasses, and calculating for each class a number C representing the sumof the frequencies of that class, said vector consisting of said numbersC, and

e) calculating a distance between the vectors of the reference audiosignal and the audio signal to be evaluated associated with each timewindow to evaluate the deterioration of the audio signal.

In a second variant, using autoregressive modeling of the audio signal,the generation of a quality indicator vector employs the following stepsfor the reference audio signal and for the audio signal to be evaluated:

a) calculating N coefficients of a prediction filter by autoregressivemodeling,

b) determining in each time window the maximum of the prediction residueas a difference between the signal predicted with the aid of theprediction filter and the audio signal, said maximum of the predictionresidue constituting said quality indicator vector, and

c) calculating a distance between said vectors of the reference audiosignal and the audio signal to be evaluated associated with each timewindow to evaluate the deterioration of the audio signal.

In a third variant, using autoregressive modeling of the basilarexcitation, the generation of a quality indicator vector employs thefollowing steps for the reference audio signal and for the audio signalto be evaluated:

a) calculating for each time window the spectral power density of theaudio signal and applying to it a filter representative of theattenuation of the inner and middle ear to obtain a frequency spreadingfunction in the basilar scale,

b) calculating individual excitations from the frequency spreadingfunction in the basilar scale,

c) obtaining the compressed loudness from said individual excitationsusing a function modeling the non-linear frequency sensitivity of theear, to obtain basilar components,

d) calculating N′ prediction coefficients of a prediction filter fromsaid basilar components by autoregressive modeling, and

e) generating for each time window a quality indicator vector from onlysome of the N′ prediction coefficients.

The quality indicator vector preferably comprises from 5 to 10 of saidprediction coefficients.

In a fourth variant, using detection of flats in the activity of thesignal, the generation of a quality indicator vector employs thefollowing steps for at least the audio signal to be evaluated:

a) calculating a temporal activity of the signal in each time window,

b) calculating a sliding average over N₁ successive values of thetemporal activity, and

c) retaining the minimum value of M₁ successive values of the slidingaverage.

The quality indicator vector may consist of said minimum value, or abinary value that is the result of comparing said minimum value with agiven threshold. The method may equally calculate a quality score bydetermining a cumulative time interval during which said minimum valueis below a given threshold S₁ and/or by determining the number of timesper second said minimum value is below a given threshold S′₁, or saidminimum values are generated at the same time for the reference audiosignal and for the audio signal to be evaluated and a quality vector isgenerated by comparing the corresponding minimum values for thereference audio signal and for the audio signal to be evaluated, forexample by calculating the difference or the ratio between said minimumvalues.

In a fifth variant, using detection of peaks in the activity of theaudio signal, the generation of a quality indicator vector employs thefollowing steps for at least the audio signal to be evaluated:

a) calculating a temporal activity of the signal in each time window,

b) calculating a sliding average over N₂ successive values of thetemporal activity, and

c) retaining the maximum value from M₂ successive values of the slidingaverage.

The quality indicator vector may consist of said maximum value or abinary value resulting from comparing said maximum value with a giventhreshold.

In the method, a deterioration indicator may be generated by comparingthe maximum value obtained for the reference audio signal and thecorresponding maximum value obtained for the audio signal to beevaluated, for example by calculating the difference or the ratiobetween the maximum values.

In a sixth variant, using calculation of the minimum of the spectrum ofthe audio signal, the generation of a quality indicator vectorcalculates, at least for the audio signal to be evaluated, the Fouriertransform in successive blocks of N₃ samples constituting said timewindows and the minimum of the spectrum in M₃ successive blocks thatconstitute a quality indicator vector.

The method may include a step of evaluating the introduction of noiseinto the audio signal to be evaluated by comparing the value of saidminimum value of the spectrum in M₃ successive blocks associated withthe audio signal to be evaluated and the maximum value of the M₃ minimaobtained in the same M₃ successive blocks associated with the referenceaudio signal.

It may equally include a step of evaluating the introduction of noiseinto the audio signal to be evaluated by comparing the value of saidminimum of the spectrum in M₃ successive blocks with an average value ofthe minima of the spectrum obtained in blocks anterior to the M₃successive blocks, for example by calculating the difference or theratio between the average values.

In a seventh variant, using estimation of the flattening of the spectrumof the audio signal, the generation of a quality indicator vectorcalculates, at least for the audio signal to be evaluated, a spectrumflattening parameter that is the ratio between an arithmetical mean anda geometrical mean of the components of the spectrum of the signal.

The method may then use an indicator of detection of deterioration ofthe audio signal by the introduction of wideband noise by comparing saidspectrum flattening parameter between the reference audio signal and theaudio signal to be evaluated, for example by calculating the differenceor the ratio between the two parameters.

BRIEF DESCRIPTION OF THE DRAWINGS

Other features and advantages of the invention will become more clearlyapparent on reading the following description, which is given withreference to the drawings, in which:

FIG. 1 is a flowchart showing a complete reference quality evaluationprocess,

FIG. 2 depicts audio transmission with loss of quality,

FIGS. 3 to 10 represent evaluation methods of the present invention, and

FIGS. 11 and 12 represent an audio quality measuring system of thepresent invention.

MORE DETAILED DESCRIPTION

The management and recovery of decoding errors are not standardized. Theimpact of these errors on perceived quality therefore depends on thecode used.

The audibility of these defects is also related to the type of elementsin the frame affected, for example MPEG elements, and its audio content.

In the case of serious transmission errors, signal quality is greatlydegraded. This degradation occurs during the broadcasting of an MPEGdigital stream, for example, and is usually impulsive. It may also occurwhen broadcasting an audio stream over the Internet or during coding ordecoding.

For this type of defect, quality may be estimated in a binary fashion;either the signal has not been degraded, and its quality depends on theinitial coding used, or errors have been introduced, and the signal hasbeen seriously degraded.

Quality may then be estimated using methods that use no reference, bycalculating the deterioration detected at regular time intervals of theorder of one second, for example. Subjective tests have yielded areliable estimate of perceived quality based on the number and length ofinterruptions related to an impulsively degraded signal.

The reduced reference measurement method proposed reduces the bit ratenecessary for conveying the reference. This authorizes the use ofchannels reserved for a relatively limited bit rate. These measurementsare used to detect forms of deterioration other than that caused bytransmission errors.

Thus the present invention provides bit rate reduction in the case ofreduced reference measurements and, by adding simple measurements withno reference, retains measurement of serious deterioration in the eventof loss of the reference, for example, by locally generating a vectorthat simply characterizes the deterioration and which can therefore beeasily processed and transmitted to a control installation, inparticular to a centralized installation.

The measurements effected along the system and at various points of thenetwork inform the digital television broadcasting monitoring andmanagement system of the overall performance of the network. Themeasured signal deterioration informs the broadcast operator of thequality of service delivered.

The method is characterized by two complementary modes of operation:

Reduced reference mode: The technological approach proposed consists ineffecting one measurement on the audio signal at the input of thetransmission system or other system under test (coder, decoder, etc.)and another at the output. Comparing these measurements verifies the“transparency” of the system and evaluates the magnitude of thedeterioration that has been introduced. Unlike the prior art technique:

-   -   the evaluation is in real time and in continuous time,    -   the reference measurements at the input of the system represent        a very small quantity of data relative to the data of the audio        signal, which explains the designation “reduced reference”, and    -   the reference data or measurements used are also a reduced        representation of the content of the signal as well as a        measurement of the magnitude of a type of deterioration.

The invention alleviates the lack of a reference signal. To this end,the method defines measurements for the characteristic digital defectsto be identified. Unlike the prior art, the approach proposed is able toestimate the deterioration of any signal reliably, and this approach maybe applied equally well at the level of an entire transmission networkor locally at the level of an equipment. Moreover, the complexity of thecalculations for this method is low, and the indicator obtainedrepresents a small quantity of data compared to the digital audiostream.

Finally, the method may be applied indifferently to purely digitalsignals and to signals that have been subjected to digital-to-analogueconversion followed by analogue-to-digital conversion aftertransmission.

The first three methods described hereinafter are “reduced reference”methods.

To obtain a more accurate quality estimate, certain of the parametersdeveloped use perceptual modeling; the theory of objective perceptualmeasurements is based on the transformation of a physical representation(sound pressure level, level, time, and frequency) into a psychoacousticrepresentation (sound strength, masking level, critical times and bandsor barks) of two signals (the reference signal and the signal to beevaluated), in order to compare them. This conversion is effected bymeans of a model of the human hearing apparatus (this modeling generallyconsists in a spectrum analysis of barks followed by spreadingphenomena). A distance between the psychoacoustic representations of thetwo signals may then be calculated, and may be related to the quality ofthe signal to be evaluated (the shorter the distance, the closer thesignal to be evaluated to the original signal and the better itsquality).

The first method uses a “perceptual counting error” parameter.

To take account of psychoacoustic factors, this parameter is calculatedin several steps. These steps are applied to the reference signal and tothe degraded signal. They are as follows:

Time windowing of the signal in blocks and then, for each of the blocks,calculating the excitation induced by the signal, using a hearing model.This representation of the signals takes account of psychoacousticphenomena and generates a histogram whose counts are the values of thebasilar components. This limits the amount of useful information byignoring everything except the audio components of the signal. To obtainthis excitation, standard modeling techniques may be used, such asattenuation of the external and middle ear, integration in criticalbands, and frequency masking. The time windows chosen are ofapproximately 42 ms duration (2 048 points at 48 kHz), with a 50%overlap. This achieves a time resolution of the order of 21 ms.

This modeling requires several steps. For the first step, the externaland middle ear attenuation filter is applied to the spectral powerdensity obtained from the spectrum of the signal. This filter also takesinto account the absolute hearing threshold. The concept of criticalbands is modeled by converting from a frequency scale to a basilarscale. The next step corresponds to calculating individual excitationsto take account of masking phenomena, using the frequency spreadingfunction of the basilar scale and non-linear addition. By means of apower function, the last step yields the compressed loudness, used formodeling the non-linear frequency sensitivity of the ear by means of ahistogram comprising the 109 basilar components.

The counts of the histogram obtained are then grouped into threeclasses. This vectorization yields a visual representation of theevolution of the structure of the signals and a simple and concisecharacterization of the signal and thus a reference parameter that is ofparticular benefit.

There are several strategies for fixing the boundaries of these threecounts; the simplest separates the histogram into three areas of equalsize. Thus the 109 basilar components (or the 24 components thatconstitute the excitation and provide a simplified representation of it)represent 24 Barks and may be separated at the following indices:

$\begin{matrix}{{S_{1} = 36},{{i.e.\mspace{11mu} z} = {{\frac{24}{109}*36} = {7.927\mspace{14mu}{Barks}}}}} & (1) \\{{S_{2} = 73},{{i.e.\mspace{11mu} z} = {{\frac{24}{109}*73} = {16.073\mspace{14mu}{Barks}}}}} & (2)\end{matrix}$

The second strategy takes into account the Beerends scaling areas. Infact, the gain between the excitation of the reference signal and thatof the signal under test is compensated by ear. The limits set are thenas follows:

$\begin{matrix}{{S_{1} = 9},{{i.e.\mspace{11mu} z} = {{\frac{24}{109}*9} = {1.982\mspace{14mu}{Barks}}}}} & (3) \\{{S_{2} = 100},{{i.e.\mspace{11mu} z} = {{\frac{24}{109}*100} = {22.018\mspace{14mu}{Barks}}}}} & (4)\end{matrix}$

The trajectory is then represented in a triangle called the triangle offrequencies. Three counts C₁, C₂ and C₃ are obtained for each block, andtherefore two Cartesian coordinates, satisfying the following equations:

$\begin{matrix}{X = {{C_{1}/N} + \frac{C_{2}/N}{2}}} & (5) \\{Y = {{C_{2}/N}*{\sin\left( {\pi/3} \right)}}} & (6)\end{matrix}$

in which:

C₁ is the sum of the basilar excitations for the high frequencies(components above S₂),

C₂ is the count associated with the medium frequencies (componentsbetween S₁ and S₂), and

N=C₁+C₂+C₃ is the total sum of the values of the components.

A point (X, Y) constituting a vector is therefore obtained for each timewindow of the signal, which corresponds to the transmission of twovalues per window of 1024 bits, for example, i.e. a bit rate of 3 kbit/sfor an audio signal sampled at 48 kHz. The representation for a completesequence is therefore a trajectory parameterized by time, as shown inFIG. 3.

The Euclidean distance between the reference signal and the degradedsignal is then calculated. In the case of continuous estimation ofquality, the distance between the points provides an estimate of themagnitude of the deterioration introduced between the reference signaland the degraded signal. Because psychoacoustic models are used, thisdistance may be regarded as a perceived distance.

To estimate a quality score for a signal of several seconds duration, itis possible to calculate a global measurement of the difference betweenthe two signals. Several metrics can be used for this. They may be ofthe diffuse type (average distance between peaks, intercepted area,etc.) or the local type (maximum and minimum distances between peaks,etc.), and depend on the position within the triangle.

It is also possible to take account of just noticeable differences.These are thresholds that determine the audibility of the differencesthat have occurred. To take account of the variability of the maskingphenomena, they may be modeled by tolerance areas as a function ofposition in the triangle.

In all cases, the two trajectories must be synchronized first.

Thus the principle of calculating this comparative parameter may besummarized in the manner of the FIG. 4 diagram.

The main advantage of this parameter is that it takes account ofpsychoacoustic phenomena without increasing the bit rate necessary totransfer the reference. In this way the reference for 1024 signalsamples may be reduced to two values (3 kbit/s).

The second method used autoregressive modeling of the signal.

The general principle of linear prediction is to model a signal as acombination of its past values. The basic idea is to calculate the Ncoefficients of a prediction filter by autoregressive (all pole)modeling. It is possible to obtain a predicted signal from the realsignal using this adaptive filter. The prediction or residual errors arecalculated from the difference between these two signals. The presenceand the quantity of noise in a signal may be determined by analyzingthese residues.

The magnitude of the modifications and defects introduced may beestimated by comparing the residues obtained for the reference signaland those calculated from the degraded signal.

Because there is no benefit in transmitting all of the residues if thebit rate of the reference is to be reduced, the reference to betransmitted corresponds to the maximum of the residues over a timewindow of given size.

Two methods of adapting the coefficients of the prediction filter aredescribed hereinafter by way of example:

-   -   The LEVINSON-DURBIN algorithm, which is described, for example,        in “Traitement numérique du signal—Théorie et pratique”        [“Digital signal processing—Theory and practice”] by M.        BELLANGER, MASSON, 1987, pp. 393 to 395. To use this algorithm,        an estimate is required of the autocorrelation of the signal        over a set of N₀ samples. This autocorrelation is used to solve        the Yule-Walker system of equations and thus to obtain the        coefficients of the prediction filter. Only the first N values        of the autocorrelation function may be used, where N designates        the order of the algorithm, i.e. the number of coefficients of        the filter. The maximum prediction error is retained over a        window comprising 1024 samples.    -   The gradient algorithm, which is also described in the        above-mentioned book by M. BELLANGER, for example, starting at        page 371. The main drawback of the preceding parameter is the        necessity, in the case of a DSP implementation, to store the N₀        samples in order to estimate the autocorrelation, together with        the coefficients of the filter, and then to calculate the        residues. The second parameter avoids this by using another        algorithm to calculate the coefficients of the filter, namely        the gradient algorithm, which uses the error that has occurred        to update the coefficients. The coefficients of the filter are        modified in the direction of the gradient of the instantaneous        quadratic error, with the opposite sign.

When the residues have been obtained from the difference between thepredicted signal and the real signal, only the maximum of their absolutevalues over a time window of given size T is retained. The referencevector to be transmitted can therefore be reduced to a single number.

After transmission followed by synchronization, comparison consists insimply calculating the distance between the maxima of the reference andthe degraded signal, for example using a difference method.

FIG. 5 summarizes the parameter calculation principle:

The main advantage of the two parameters is the bit rate necessary fortransferring the reference. This reduces the reference to one realnumber for 1024 signal samples.

However, no account is taken of any psychoacoustic model.

The third method uses autoregressive modeling of the basilar excitation.

In contrast to the standard linear prediction method, this method takesaccount of psychoacoustic phenomena in order to obtain an evaluation ofperceived quality. For this purpose, calculating the parameter entailsmodeling diverse hearing principles. Linear prediction models the signalas a combination of its past values. Analysis of the residues (orprediction errors) determines the presence of noise in a signal andestimates the noise. The major drawback of these techniques is that theytake no account of psychoacoustic principles. Thus it is not possible toestimate the quantity of noise actually perceived.

The method uses the same general principle as standard linear predictionand additionally integrates psychoacoustic phenomena in order to adaptto the non-linear sensitivity of the human ear in terms of frequency(pitch) and intensity (loudness).

The spectrum of the signal is modified by means of a hearing modelbefore calculating the linear prediction coefficients by autoregressive(all pole) modeling. The coefficients obtained in this way provide asimple way to model the signal taking account of psychoacoustics. It isthese prediction coefficients that are sent and used as a reference forcomparison with the degraded signal.

The first part of the calculation of this parameter modelspsychoacoustic principles using the standard hearing models. The secondpart calculates linear prediction coefficients. The final part comparesthe prediction coefficients calculated for the reference signal andthose obtained from the degraded signal. The various steps of thismethod are therefore as follows:

-   -   Time windowing of the signal followed by calculation of an        internal representation of the signal by modeling psychoacoustic        phenomena. This step corresponds to the calculation of the        compressed loudness, which is in fact the excitation in the        inner ear induced by the signal. This representation of the        signal takes account of psychoacoustic phenomena and is obtained        from the spectrum of the signal, using the standard form of        modeling: attenuation of the external and middle ear,        integration in critical bands, and frequency masking; this step        of the calculation is identical to the parameter described        above;    -   Autoregressive modeling of the compressed loudness in order to        obtain the coefficients of an RIF prediction filter, exactly as        in standard linear prediction; the method used is that of        autocorrelation by solving the Yule-Walker equations; the first        step for obtaining the prediction coefficients is therefore        calculating the autocorrelation of the signal.

It is possible to calculate the perceived autocorrelation of the signalusing an inverse Fourier transform by considering the compressedloudness as a filtered spectral power.

One method of solving the Yule-Walker system of equations and thus ofobtaining the coefficients of a prediction filter uses theLevinson-Durbin algorithm.

It is the prediction coefficients that constitute the reference vectorto be sent to the comparison point. The transforms used for the finalcalculations on the degraded signal are the same as are used for theinitial calculations applied to the reference signal.

-   -   Estimating the deterioration by calculating a distance between        the vectors from the reference and from the degraded signal.        This compares coefficient vectors obtained for the reference and        for the transmitted audio signal, enabling the deterioration        caused by transmission to be estimated, using an appropriate        number of coefficients. The higher this number, the more        accurate the calculations, but the greater the bit rate        necessary for transmitting the reference. A plurality of        distances may be used to compare the coefficient vectors. The        relative size of the coefficients may be taken into account, for        example.

The principle of the method may be as summarized in the FIG. 6 diagram.

Modeling psychoacoustic phenomena yields 24 basilar components. Theorder N of the prediction filter is 32. From these components, 32autocorrelation coefficients are estimated, yielding 32 predictioncoefficients, of which only 5 to 10 are retained as a quality indicatorvector, for example the first 5 to 10 coefficients.

The main advantage of this parameter is that it takes account ofpsychoacoustic phenomena. To this end, it has been necessary to increasethe bit rate needed to transfer the reference consisting of 5 or 10values for 1024 signal samples (21 ms for an audio signal sampled at 48kHz), that is to say a bit rate of 7.5 to 15 kbit/s.

The following methods may be used with or without a reference. Thismeans that the measurements for detecting more serious deterioration areretained, even if no reference parameter is available at the controlpoint at the time when the comparison must be effected.

The first of these methods uses detection of flats in the activity ofthe signal.

The notion of activity, which may be approximated by differentiating theaudio signal, is used to identify breaks and interruptions in thetemporal signal.

These types of error are characteristic of coding errors aftertransmitting a digital audio stream or broadcasting sound sequences overthe Internet. They occur when the bit rate of the network is too low toensure the arrival of all the necessary frames by the time for decoding,for example.

These forms of deterioration, which introduce areas of very lowactivity, are reflected in different auditory sensations for the hearer:breaks in the sound, blurred sound, impulsive noise, etc.

The first step of calculating the parameter is estimating the temporalactivity of the signal. To this end, a second derivative operator isused. It provides a sufficiently precise estimate of activity andrequires only a very few calculations.

The following formula, in which f(t) corresponds to the value of thesample at time t, is a simple way to simulate this second derivativeoperator:f″(x ₀)=f(x ₀+2)−2·f(x ₀)+f(x ₀−2)  (7)orf″(x ₀)=f(x ₀+1)−2·f(x ₀)+f(x ₀−1)  (8)

A sliding average over N values is then used to smooth the variations inthe curve obtained and thus to prevent false detection (for exampleN=21, which corresponds to 0.5 ms for a sampling frequency of 48 kHz).Only one result is retained per block of M results (M corresponds to2048 audio samples, for example). The minimum of the M averages isretained and transmitted. The parameter is therefore obtained at time tfrom the following formula, in which y(t) corresponds to the activity:

$\begin{matrix}{{{Flats}\mspace{11mu}(t)} = {\min\limits_{k \in M}\left( {\frac{1}{N}{\sum\limits_{i \in M}{{y\left( {t - k - i} \right)}}}} \right)}} & (9)\end{matrix}$

If the parameter is used with a reference, after synchronizing the data,the comparison step is a simple difference operation that identifiesareas in which the signal has been replaced by decoding flats. Onlytimes at which the activity of the degraded signal is greatly reducedare of interest. Thus the comparison formula is as follows, whereFlats_(f)(t) and Flats_(d)(t) are respectively the parameter calculatedfor the reference and the parameter calculated for the degraded signal:d(t)=max(0,Flats_(f)(t)−Flats_(d)(t))  (10)

To reduce further the bit rate necessary for transporting the reference,it is also possible to compare the parameter Flats(t) calculated fromthe signal with a threshold S and thus to obtain a binary parameter. Thedrop in activity in the event of deterioration is in fact sufficientlygreat to be detected in this way.

In this case, comparison serves only to confirm the presence ofdeterioration. Thus no confusion is possible between areas of silenceand areas of weak activity of the signal. Using the parameter with noreference nevertheless identifies the deterioration.

The psychoacoustic magnitude of the deterioration detected must beanalyzed to proceed from detecting deterioration to estimating aperceived quality score. The perceived deterioration may vary greatlyaccording to its length and the number of occurrences.

The next step therefore uses correspondence curves based on the binaryparameter. These curves yield a quality score from the cumulative lengthof the impulsive deterioration and the number detected per second. Thesecurves are established from subjective tests. Difference curves may beestablished as a function of the audio signal type (mainly speech ormusic). Once the estimate has been obtained, it is equally possible touse a filter for simulating the response of a panel member. This takesaccount of the dynamic effect of the votes and the time to react to thedeterioration.

The FIG. 7 diagram summarizes the parameter.

The main advantage of this parameter is being able to effectmeasurements with no reference. Another benefit is the bit rate neededto transfer the reference, which reduces the reference to one realnumber, i.e. a bit rate of 1.5 kbit/s for 1024 signal samples (or evenreduces it to one bit if a threshold is used, that is to say a bit rateof 47 bit/s). Note also that the algorithm is very simple and of reducedcomplexity and may therefore be installed in parallel with otherparameters.

The second method uses activity peak detection.

This parameter, just like the preceding one, is based on the activity ofthe signal. It detects loss of synchronization, breaks in the audiosignal, cutting off of a portion of the audio signal and aberrantsamples by looking for peaks in the activity of the signal.

Accordingly, this time, only the maxima for blocks of M samples areretained. There is no benefit in transmitting and then comparing all ofthe activity values if the objective is mainly to obtain a reducedreference method.

The parameter is therefore obtained at the time t from the followingformula, in which y(t) is the activity of the signal calculated by thefilter:

$\begin{matrix}{{{ActTemp}(t)} = {\max\limits_{k \in M}\left( {y\left( {t - k} \right)} \right)}} & (11)\end{matrix}$

In the case of a method using a reference, the same calculation iseffected on the reference signal and on the degraded signal.

After synchronizing the two streams, comparing these activity maximadetects areas in which the signal has been disturbed.

To make this comparison, the ratio between the value measured for thereference and that obtained from the degraded signal shows updeterioration. It is possible to detect areas in which activity has beengreatly reduced by choosing the maximum of the ratio and its inverse.

The following formula is used, in which ActTemp_(r)(t) andActTemp_(d)(t) are respectively the parameter calculated for thereference and the parameter calculated from the degraded signal:

$\begin{matrix}{{d(t)} = {\max\left( {\frac{{ActTemp}_{d}(t)}{{ActTemp}_{r}(t)},\frac{{ActTemp}_{r}(t)}{{ActTemp}_{d}(t)}} \right)}} & (12)\end{matrix}$

If the reference is not available, it is possible to use a threshold S′and to detect if the parameter is above the threshold, which indicatesthe presence of deterioration. To prevent false detection caused byimpulsive signals (sharp attack, percussive components), the thresholdmust have a relatively high value, which may lead to failure ofdetection.

As in the preceding situation, correspondence curves may be used toestimate perceived quality. The method consists in integrating thedeterioration detected by this parameter with other deterioration foundusing the preceding parameter, for example, and thereby to obtain aperceived global estimate.

The FIG. 8 diagram depicts the principle of this parameter.

As for the preceding parameter, the advantage of this parameter is thatit is possible to achieve detection with no reference.

The reduced complexity and the low bit rate needed to transport thereference, limited to one value, i.e. to a bit rate of 1.5 kbit/s for1024 signal samples sampled at 48 kHz (or even to one bit using athreshold, i.e. a bit rate of 47 bit/s) are also benefits.

The following method evaluates the minimum of the signal spectrum tolocate deterioration.

It mainly useful for detecting “impulsive” deterioration. It isimportant to note that most of the deterioration that occurs whentransmitting an audio signal is of this type, very localized in time andvery spread out in frequency. Accordingly, by treating it like widebandwhite noise in the signal, of very short duration, it is possible todetect it by analyzing the characteristics of the spectrum.

The first step of calculating these parameters is estimating thespectrum of the signal. To this end, the signal is divided into windowscomprising blocks of N samples (N=1024 or 2048, for example), with anoverlap of N/2 samples. This provides sufficient temporal resolution andanalyzes the whole of the signal, taking account of the fact that theuse of windowing greatly attenuates the influence of the edges of thetime windows.

It also means that the calculation time at the installation stage is notexcessively penalized. A fast Fourier transform is then used to changeto the frequency domain.

The occurrence of deterioration raises the minimum of the spectrumbecause of the introduction of wideband white noise into all thefrequency components of the spectrum. This is the basic principle behindthe development of this parameter, which is simple to calculate usingthe following formula, in which x_(i) are the N components of thespectrum X in dB (obtained by remote calculation):MinSpe=min(x ₁) for 1≦i≦N  (13)

In the case of methods using a reference, simple comparison aftersynchronizing the values obtained from the reference and from thedegraded signal is generally insufficient to detect deterioration,because of the high variation of the minima obtained with a non-degradedsignal.

Comparison must therefore be carried out by blocks of M values and inaccordance with the following principle: for each block, only themaximum of the M minima obtained from the reference is retained, andprovides a reference value for the initial noise level for the block.This value is compared to the M minima obtained from the degradedsignal.

By retaining only the times at which the minima are increased, it ispossible to detect the times at which noise is added to the signal.

The distance obtained for each moment t is therefore:

$\begin{matrix}{{d(t)} = {\max\left\{ {{{\min\limits_{i \in N}\left( {x_{d,i}(t)} \right)} - {\max\limits_{k \in M}\left\lbrack {\underset{i \in N}{\min_{\; k}}\left( {x_{r,i}(t)} \right)} \right\rbrack}},0} \right\}}} & (14)\end{matrix}$

where:

x_(r,i) is the i^(th) component of the N components of the spectrumobtained from the reference,

x_(d,i) is the i^(th) component of the N components of the spectrumobtained from the degraded signal, and

min_(x) is the k^(th) minimum of the M minima of the block concerned.

If the reference is not available, it is possible to use a mean value ofthe minima of the spectrum obtained previously by the algorithm. Theremainder of the comparison is then effected in the same way.

As in the preceding situations, correspondence curves may be used byintegrating the deterioration detected using this parameter with otherdeterioration to obtain a perceived measurement.

The two diagrams in FIG. 9 summarize the method.

Once again, the main advantage of these parameters is the ability toobtain measurements with no reference. Another benefit is the bit rateneeded to transfer the reference. This reduces the reference to one realnumber and even one integer, i.e. a bit rate of at most 1.5 kbit/s for Nsignal samples (N=1024, for example). The reduced complexity of thealgorithm is also a benefit.

In the next method, which analyses spectral flattening, two parametersSF₁ and SF₂ are used to estimate the “flattening” of the spectrum, forwhich the expression “statistical flattening” is sometimes used. Theseparameters evaluate the shape of the spectrum and its evolution alongthe sequence under study. If broadband noise appears in the signal, acontinuous white noise type component causes flattening of the spectrum.

Parameter SF₁

When deterioration occurs, the components that had values close to zerobefore will have non-negligible values. The product of the spectrumcomponents will therefore be greatly increased, whereas their sum willvary only a little. To exploit this, the spectrum flattening estimationparameter SF₁ is calculated from the following formula, in which X isthe spectrum of the signal and x_(i) represents the components of thespectrum:

$\begin{matrix}{{S\; F_{1}} = {{{10 \cdot \log}\; 10\mspace{11mu}\left( \frac{{ArithmeticMean}(X)}{{GeometricMean}(X)} \right)}\mspace{40mu} = {{10 \cdot \log}\; 10\mspace{11mu}\left( \frac{\frac{1}{N}{\sum\limits_{i = 1}^{N}\; x_{i}}}{\sqrt[N]{\prod\limits_{i = 1}^{N}\; x_{i}}} \right)}}} & (15)\end{matrix}$

This parameter is calculated in the same way for the reference and forthe degraded signal. It is then possible to estimate the inserted whitenoise level, and consequently the deterioration, by means of acomparison.

Parameter SF₂

The statistical flattening coefficient known as “kurtosis” or“concentration” is used to calculate this parameter. The estimate isbased on 2^(nd) and 4^(th) order centered moments. These enable theshape of the spectrum to be estimated relative to a normal distribution(in the statistical sense).

The calculation corresponds to the ratio of the 4^(th) order centeredmoment and the 2^(nd) order centered moment (variance) to the square ofthe coefficients of the spectrum. The formula used is as follows:

$\begin{matrix}{{S\; F_{2}} = {\frac{m_{4}(X)}{m_{2}^{2}(X)} = {\frac{m_{4}(X)}{\delta^{4}} = {N \cdot \frac{\sum\limits_{i = 1}^{N}\left( \;{x_{i} - \overset{\_}{X}} \right)^{4}}{\left( {\sum\limits_{i = 1}^{N}\left( \;{x_{i} - \overset{\_}{X}} \right)^{2}} \right)^{2}}}}}} & (16)\end{matrix}$with centered moments m_(k) defined by the equation:

$\begin{matrix}{m_{k} = \frac{\sum\limits_{i = 1}^{N}\left( \;{x_{i} - \overset{\_}{X}} \right)^{k}}{N}} & (17)\end{matrix}$

in which X is the arithmetic mean of the N components x_(i) of thespectrum X in dB.

As with the parameter SF₁, the higher the value obtained, the moreconcentrated the signal and the less noise there is in the signal. Thelatter is calculated for the reference and for the degraded signal. Theinserted white noise level is estimated by comparison.

The FIG. 10 diagram depicts this principle, which is valid for both theabove parameters.

In the case of comparison with the reference, a single distance of thedifference or other type is sufficient for detecting deterioration. Ifno reference is available, it is necessary to look for deterioration bydetecting peaks in the variation of the parameters. This may be doneusing the standard grey level mathematical morphology technique(erosions and expansions) used in the image processing field.

The advantages and limitations of these parameters are identical tothose of the preceding parameters: the necessary bit rate is limited andusing no reference is possible, as is using correspondence curves toestimate the perceived magnitude of the deterioration.

In the context of monitoring a digital television broadcast network, thereference audio signal corresponds to the signal at the input of thebroadcast network. The reference parameters are calculated for thissignal and then sent over a dedicated channel to the requiredmeasurement point, at which the same parameters, needed for thecomparison for establishing reduced reference measurements, arecalculated. Measurements with no reference are also calculated. If thereference parameters are not available (not present, erroneous, etc.),these measurements are sufficient for detecting more serious errors. Thesubsystems shown in dashed line in FIG. 11 are then no longer used.

The measurements obtained with no reference and the reduced referencemeasurements (obtained when it has been possible to calculate them) areused by a model for estimating the magnitude of the deteriorationinduced by broadcasting the signals.

The FIG. 11 diagram summarizes this embodiment:

Thus a plurality of measurement points may be established. Once theseestimates of the deterioration have been obtained, it is a simple matterto send them to a network monitoring centre which provides an overviewof network performance.

The same diagram as before may then be used to visualize Internet radiobroadcast performance (with or without a reference). In this case, thedata channel used to transport the reference parameters may be thenetwork itself, in exactly the same way as for returning estimatedscores to the monitoring centre. The reference signal corresponds to thesignal sent by the server and the degraded signal is that decoded at thechosen measurement point. For example, it is possible to choose the mostappropriate server as a function of the connection point by accessingmonitoring centre data. The next diagram (FIG. 12) depicts thisembodiment in the situation in which reference parameters are sent bythe network and the scores obtained are sent over a dedicated channel.

A method of the invention may be applied whenever it is necessary toidentify defects in an audio signal transmitted over any broadcastnetwork (cable, satellite, microwave, Internet, DVB, DAB, etc.).

The process proposed uses two classes of methods: reduced referencetechniques and techniques with no reference. It is of particular benefitwhen the bit rate available for transmitting the reference is limited.

Accordingly, the invention is applicable to operating metrologyequipment and audio signal distribution network supervisory systems. Oneof its advantageous features is to combine measurements effected withand without a reference. Finally, the invention conforms to therequirements of quality of service management systems.

1. A method of qualitatively evaluating a digital audio signal,comprising: calculating, using a measuring system, in real time, incontinuous time, and in successive time windows, a quality indicator,wherein said calculating further comprises: a) calculating a temporalactivity of the digital audio signal in each of said time windows, b)calculating a sliding average over N₁ successive values of the temporalactivity, and c) retaining a minimum value of M₁ successive values ofthe sliding average, and wherein: said quality indicator is obtainedfrom said digital audio signal that represents an analog audio signal,said quality indicator is associated with each of said time windows, andsaid quality indicator comprises a number of elements which is at leastone hundred times less than the number of audio samples in a timewindow, said number being from 1 to 10; and directly estimating qualityof said digital audio signal as a function of said quality indicator. 2.A method according to claim 1, wherein said quality indicator comprisessaid minimum value.
 3. A method according to claim 1, wherein saidquality indicator comprises a binary value that is the result ofcomparing said minimum value with a given threshold.
 4. A methodaccording to claim 1, including calculating a quality score bydetermining a cumulative time interval during which said minimum valueis below a given threshold S₁ or by determining the number of times persecond said minimum value is below a given threshold S′₁ or bydetermining both said cumulative time interval and the number of timesper second.
 5. A method according to claim 1, wherein said minimumvalues are generated at the same time for a reference audio signal andfor the digital audio signal to be evaluated and a quality is generatedby comparing the corresponding minimum values for the reference audiosignal and for the audio signal to be evaluated.